comment on

The problem here is that the Encode module is limited to the very basic cases of 1 unicode character mapping to 1 non-unicode character and vice-versa. Encode is not capable of mapping a single unicode character to a sequence of two non-unicode characters.

One approach you could use is to pre-process the unicode Vietnamese text in order to "decompose" the characters in the range "\x{1EA0}" - "\x{1EF9}" (the "Latin Extensions for Vietnamese") into the components that actually exist in the cp1258 code page. At that point, it should be possible to do a simple "encode('cp1258', $decomp_string)".

It does seem a bit disingenuous to include cp1258 among the legacy encodings "supported" by the Encode module, given that unicode text data that uses the "Latin extensions for Vietnamese" will often be impossible to encode using this mapping (and decoding cp1258 text into utf8 unicode will be "defective" because it won't produce the complex characters in the Vietnamese extension set).

If you pull down a copy of the unicode database, you can parse that to derive the decompositions for each of the characters in the "extensions for Vietnamese" range, and with that, build a simple filter that will replace each "complex" unicode character with its two-character form that should make it manageable with "encode('cp1258',...)". Here's a start:

#!/usr/bin/perl -CS

use strict;

open(DB,"UnicodeData.txt") or die "UnicodeData.txt: $!";
my @lines = grep /^1E[A-F].;/, <DB>;
close DB;

# load the decomp hash: keys are unicode Vietnamese,
# values are two-character decompositions

my %decomp;
for ( @lines ) {
    my ($u,$d) = (split /;/, $_ )[0,5];
    my $uc = chr( hex( $u ));
    my $cc = join '', map { chr( hex( $_ )) } split / /,$d;
    $decomp{$uc} = $cc;
}

my $todecomp = join '', keys %decomp;
# now apply decomposition to data:

while (<>) {
    s/([$todecomp])/$decomp{$1}/g;
    print;
}
[download]

You can either pipe that to another script that will use Encode, or else just just add "use Encode" above and do "encode('cp1258',$_)" before doing "print" in the while loop.

(Update: I made the "grep" more selective on the UnicodeData.txt file, instead of just "1E..". Also, if you choose to add the "use Encode" and "encode('cp1258',$_)" to that script, be sure to change the flag on the shebang line from "-CS" to "-CI", because STDOUT won't be unicode data in that case.)

In reply to Re: The Encode::Byte module by graff
in thread The Encode::Byte module by rubenstein

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.