Character set conversions

ottaky has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a number of text files from a previous project that include Chinese, Japanese, Arabic etc. text encoded in character sets like gb2132, s-jis and the like.

I now need to take that text and convert it into utf8.

Looking on CPAN I've found various Unicode:: modules, but they don't seem to like 2 byte chracter encodings.

Can anybody suggest a method to convert from one character set to utf8 in a fairly bombproof way?

Thanks!

Comment on Character set conversions

Replies are listed 'Best First'.
Re: Character set conversions by Abigail-II (Bishop) on Nov 11, 2003 at 16:29 UTC
Did you look at the Encode module(s)? Abigail	[reply]
Re: Re: Character set conversions by ottaky (Novice) on Nov 11, 2003 at 16:53 UTC
Actually, I didn't - but I am now ;-)	[reply]
Re: Character set conversions by allolex (Curate) on Nov 11, 2003 at 16:49 UTC
You might want to Super Search for "recode" and "iconv", especially if you don't necessarily need a Perlish solution to the recoding. -- Allolex Perl and Linguistics http://world.std.com/~swmcd/steven/perl/linguistics.html http://www.linuxjournal.com/article.php?sid=3394 http://www.wall.org/~larry/keynote/keynote.html	[reply]
Re: Re: Character set conversions by ottaky (Novice) on Nov 11, 2003 at 17:23 UTC
Aha - now you're talking. I'd never even heard of iconv before. Cheers!	[reply]
Re: Character set conversions by graff (Chancellor) on Nov 12, 2003 at 01:52 UTC
Between the 5.8.1 Encode module and the various conversion tools available from other sources, I would actually tend to prefer the former, especially for Arabic. Last time I tried iconv to go from (e.g.) iso-8859-6 to utf8, it took the liberty of converting ascii digits to "Arabic script" digits, where this is neither justified nor desirable. (The original 5.8.0 encoding table did the same thing, but they fixed it in 5.8.1. For that matter, maybe iconv has been fixed since then as well -- I have actually seen a few different versions of iconv in operation, whose character-set inventories seemed oddly different.) My main point is, be careful when doing character-encoding coversion on any non-European langauge; command-line utils (iconv, etc) may perform some replacements that are inappropriate, and yield more "?" (no-such-character) outputs than you would expect -- and sometimes this will be due to unexpected properties of the input data. Encode.pm might do the same in some cases, when left to its default behavior, but at least you have the ability to change its behavior, and you can create and use alternate character mapping tables if necessary. (Check out "perldoc enc2xs".)	[reply]