One approach you could use is to pre-process the unicode Vietnamese text in order to "decompose" the characters in the range "\x{1EA0}" - "\x{1EF9}" (the "Latin Extensions for Vietnamese") into the components that actually exist in the cp1258 code page. At that point, it should be possible to do a simple "encode('cp1258', $decomp_string)".
It does seem a bit disingenuous to include cp1258 among the legacy encodings "supported" by the Encode module, given that unicode text data that uses the "Latin extensions for Vietnamese" will often be impossible to encode using this mapping (and decoding cp1258 text into utf8 unicode will be "defective" because it won't produce the complex characters in the Vietnamese extension set).
If you pull down a copy of the unicode database, you can parse that to derive the decompositions for each of the characters in the "extensions for Vietnamese" range, and with that, build a simple filter that will replace each "complex" unicode character with its two-character form that should make it manageable with "encode('cp1258',...)". Here's a start:
You can either pipe that to another script that will use Encode, or else just just add "use Encode" above and do "encode('cp1258',$_)" before doing "print" in the while loop.#!/usr/bin/perl -CS use strict; open(DB,"UnicodeData.txt") or die "UnicodeData.txt: $!"; my @lines = grep /^1E[A-F].;/, <DB>; close DB; # load the decomp hash: keys are unicode Vietnamese, # values are two-character decompositions my %decomp; for ( @lines ) { my ($u,$d) = (split /;/, $_ )[0,5]; my $uc = chr( hex( $u )); my $cc = join '', map { chr( hex( $_ )) } split / /,$d; $decomp{$uc} = $cc; } my $todecomp = join '', keys %decomp; # now apply decomposition to data: while (<>) { s/([$todecomp])/$decomp{$1}/g; print; }
(Update: I made the "grep" more selective on the UnicodeData.txt file, instead of just "1E..". Also, if you choose to add the "use Encode" and "encode('cp1258',$_)" to that script, be sure to change the flag on the shebang line from "-CS" to "-CI", because STDOUT won't be unicode data in that case.)
In reply to Re: The Encode::Byte module
by graff
in thread The Encode::Byte module
by rubenstein
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |