I'm working on a system that processes a bunch of email, saves it all in a DB, and then allows web users to view and reply to the messages (this is a custom call center application). Why this is tricky for me:

  1. The messages we receive are sent in various languages and encodings, both western and east asian.
  2. We have an extensive knowledge base (KB) of canned answers to questions. We don't mind duplicating these in each language we support, but we'd rather not have them in multiple encodings as well. We currently have all the KB entries stored as UTF-8.

The difficulty is that when I compose a message reply, we want to include the original message, and allow insertion of KB entries. Now if I set the encoding of the page (that displays the reply) to UTF-8, we can read the KB entries as they're inserted, but some of the original message is garbled (especially if gb2312 or big5). But, if I instead encode it using the charset of the original message, I can't read the KB entries.

The solution seems to be to convert either the original message or the KB entries used into the other charset. And that's my question: how do I do this? I've poked around CPAN, and the most likely suspect (Unicode::MapUTF8) does not support the charsets I need. I've also tried forking an iconv(1) to do the conversion, but it just aborts when it encounters a byte sequence it doesn't recognize, which is apparently pretty often with our data.

So does anyone have any wisdom or experience with charset conversions that they care to share?

Thanks, and happy holidays!

--roundboy


In reply to Unicode & charset conversions - how? by roundboy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.