ottaky has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a number of text files from a previous project that include Chinese, Japanese, Arabic etc. text encoded in character sets like gb2132, s-jis and the like.

I now need to take that text and convert it into utf8.

Looking on CPAN I've found various Unicode:: modules, but they don't seem to like 2 byte chracter encodings.

Can anybody suggest a method to convert from one character set to utf8 in a fairly bombproof way?

Thanks!

Replies are listed 'Best First'.
Re: Character set conversions
by Abigail-II (Bishop) on Nov 11, 2003 at 16:29 UTC
    Did you look at the Encode module(s)?

    Abigail

      Actually, I didn't - but I am now ;-)
Re: Character set conversions
by allolex (Curate) on Nov 11, 2003 at 16:49 UTC
      Aha - now you're talking. I'd never even heard of iconv before. Cheers!
Re: Character set conversions
by graff (Chancellor) on Nov 12, 2003 at 01:52 UTC
    Between the 5.8.1 Encode module and the various conversion tools available from other sources, I would actually tend to prefer the former, especially for Arabic. Last time I tried iconv to go from (e.g.) iso-8859-6 to utf8, it took the liberty of converting ascii digits to "Arabic script" digits, where this is neither justified nor desirable. (The original 5.8.0 encoding table did the same thing, but they fixed it in 5.8.1. For that matter, maybe iconv has been fixed since then as well -- I have actually seen a few different versions of iconv in operation, whose character-set inventories seemed oddly different.)

    My main point is, be careful when doing character-encoding coversion on any non-European langauge; command-line utils (iconv, etc) may perform some replacements that are inappropriate, and yield more "?" (no-such-character) outputs than you would expect -- and sometimes this will be due to unexpected properties of the input data.

    Encode.pm might do the same in some cases, when left to its default behavior, but at least you have the ability to change its behavior, and you can create and use alternate character mapping tables if necessary. (Check out "perldoc enc2xs".)