Dear Monks,
do you remember how the old printers were dealing with character sets?
They used ESCape sequences to switch between tables, and by doing so they were able to access more characters than would have been possible else.
Now I am working with records that are displaying the same technique.
E.g. In the default character set the byte sequence
\xE0\x65 may represent one character.
But after hitting upon the ESC sequence \x1B\x49 we are supposed to look in another character set. And from here on
\xE0\x65 may denote a different character.
I want to be able to convert these records to UTF8.
I have been thinking of two ways to deal with the changing character sets.
1. To use
$line =~ s/$regexp|$ESC/$CURRENTCHARSET->{$&} or change_charset/ge; Where I try to catch any ESC sequences and if one is encountered I simply substitutes the ESC seq itself for an empty string and change the CURRENT CHARSET to the appropriate one.
Here is some code that deals with an utterly simplified case. It seems to work, even if it probably is not very efficient. (But in my case I don't think it is any issue.)
#!/usr/bin/perl -w
use strict;
use diagnostics;
# Mappings between character sets
my %M_2_UTF8 = ( 'A' => 'X', ); # default set
my %Ma_2_UTF8 = ( 'A' => 'a', ); # set a
my $CURRENTSET = \%M_2_UTF8;
my @setqueue = (); # a stack where we push and
+ pop
# ESC-sequences
my $switch_to_set_a = "\x1Ba";
my $reset = "\x1Bs"; # return to previous charse
+t
my %ESC = ( $switch_to_set_a => sub {
push(@setqueue, $CURRENTSET);
$CURRENTSET = \%Ma_2_UTF8;
return '';
},
$reset => sub {
$CURRENTSET = pop @setqueue;
return '';
},
);
my $str = 'X|a|X';
my $esc_str = qq(A|${switch_to_set_a}A|${reset}A);
my $convstr = fix($esc_str);
if ($str eq $convstr) {
print "ok\n";
}
else {
print "didn't work...\n";
}
sub fix {
my $s = shift;
$s =~ s{(\x1B[abcs])|(.)}
{repl($&)}ge;
return $s;
}
sub repl {
my $match = shift;
if (exists $ESC{$match}) {
return $ESC{$match}();
}
elsif (exists $CURRENTSET->{$match}) {
return $CURRENTSET->{$match};
}
else {
return $match;
}
}
__END__
2. My second thought is to use the regexp feature
(?{code}) to swap between charsets. But here I am entirely at a loss on how to implement the idea.
Well, Monks, you have helped me at many times before.
Do you once more have any advice to give me?
Should I work along the lines you have seen in this posting. Or can you suggest better methods? It is a big bonus if the resulting code is rather easy to understand and maintain. Like changing the character mappings, remove mappings, add more mappings and so on.
Yours most sincerly,
/L
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.