rubenstein has asked for the wisdom of the Perl Monks concerning the following question:

I have been using Perl's Encode module to do conversions of text from Unicode into various legacy encodings. (I realize that might seem backward but, nonetheless...) For the most part this has worked fine (for instance, creating Arabic cp1256 documents from a text composed in utf-8). I am having major problems, however, when I try to do conversion into the single byte Vietnamese encodings known as VISCII and CP1258.

The first problem is that characters which should convert smoothly do not. For instance, a message comes back that {\x1ead} "ậ" ( that is, "LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW" ) does not map to cp1258. It should, however, with the bytes: 0x00e2 0x00f2. It seems that the Encode::Byte module which claims to handle cp1258 conversion can't handle these complex Vietnamese characters (which are quite common).

The second problem is that, after having done a piece of the conversion, the process totally crashes with this message:

panic: sv_setpvn called with negative strlen at c:\Perl\lib\convert.pl line 52, <IN> line 838.

Line 52 is the line where I print through the filehandle OUT (through the layer appropriate to the encoding in question - cp1258 here):

use Encode; use Encode::Byte; open (IN, "<:encoding($enc)", $infile) #assume $enc = utf8 open (OUT, ">:encoding($dest_enc)", $target) #assume $dest_enc = cp +1258) while (my $conv = <IN>) { print OUT $conv; }
I really have no idea what the "panic" message means. But beyond simply not encoding the characters, the effect of the error is to stop the process of reading lines in and printing them out.

Does anyone have expertise in the encoding module that can help me here? Alternately, does anyone know of another means of converting text into Vietnamese legacy encodings-- I have already worked with (an implemenation based on) iconv and found it unsatisfactory.

Replies are listed 'Best First'.
Re: The Encode::Byte module
by graff (Chancellor) on Jul 12, 2005 at 04:13 UTC
    The problem here is that the Encode module is limited to the very basic cases of 1 unicode character mapping to 1 non-unicode character and vice-versa. Encode is not capable of mapping a single unicode character to a sequence of two non-unicode characters.

    One approach you could use is to pre-process the unicode Vietnamese text in order to "decompose" the characters in the range "\x{1EA0}" - "\x{1EF9}" (the "Latin Extensions for Vietnamese") into the components that actually exist in the cp1258 code page. At that point, it should be possible to do a simple "encode('cp1258', $decomp_string)".

    It does seem a bit disingenuous to include cp1258 among the legacy encodings "supported" by the Encode module, given that unicode text data that uses the "Latin extensions for Vietnamese" will often be impossible to encode using this mapping (and decoding cp1258 text into utf8 unicode will be "defective" because it won't produce the complex characters in the Vietnamese extension set).

    If you pull down a copy of the unicode database, you can parse that to derive the decompositions for each of the characters in the "extensions for Vietnamese" range, and with that, build a simple filter that will replace each "complex" unicode character with its two-character form that should make it manageable with "encode('cp1258',...)". Here's a start:

    #!/usr/bin/perl -CS use strict; open(DB,"UnicodeData.txt") or die "UnicodeData.txt: $!"; my @lines = grep /^1E[A-F].;/, <DB>; close DB; # load the decomp hash: keys are unicode Vietnamese, # values are two-character decompositions my %decomp; for ( @lines ) { my ($u,$d) = (split /;/, $_ )[0,5]; my $uc = chr( hex( $u )); my $cc = join '', map { chr( hex( $_ )) } split / /,$d; $decomp{$uc} = $cc; } my $todecomp = join '', keys %decomp; # now apply decomposition to data: while (<>) { s/([$todecomp])/$decomp{$1}/g; print; }
    You can either pipe that to another script that will use Encode, or else just just add "use Encode" above and do "encode('cp1258',$_)" before doing "print" in the while loop.

    (Update: I made the "grep" more selective on the UnicodeData.txt file, instead of just "1E..". Also, if you choose to add the "use Encode" and "encode('cp1258',$_)" to that script, be sure to change the flag on the shebang line from "-CS" to "-CI", because STDOUT won't be unicode data in that case.)