Shooting at a Moving Target

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

do you remember how the old printers were dealing with character sets?
They used ESCape sequences to switch between tables, and by doing so they were able to access more characters than would have been possible else.

Now I am working with records that are displaying the same technique.
E.g. In the default character set the byte sequence \xE0\x65 may represent one character.
But after hitting upon the ESC sequence \x1B\x49 we are supposed to look in another character set. And from here on \xE0\x65 may denote a different character.

I want to be able to convert these records to UTF8.

I have been thinking of two ways to deal with the changing character sets.
1. To use $line =~ s/$regexp|$ESC/$CURRENTCHARSET->{$&} or change_charset/ge; Where I try to catch any ESC sequences and if one is encountered I simply substitutes the ESC seq itself for an empty string and change the CURRENT CHARSET to the appropriate one.

Here is some code that deals with an utterly simplified case. It seems to work, even if it probably is not very efficient. (But in my case I don't think it is any issue.)

#!/usr/bin/perl -w
use strict;
use diagnostics;

# Mappings between character sets
my %M_2_UTF8   = ( 'A' => 'X', );          # default set
my %Ma_2_UTF8  = ( 'A' => 'a', );          # set a

my $CURRENTSET = \%M_2_UTF8; 
my @setqueue = ();                         # a stack where we push and
+ pop


# ESC-sequences
my $switch_to_set_a = "\x1Ba";
my $reset           = "\x1Bs";             # return to previous charse
+t

my %ESC = ( $switch_to_set_a => sub {
                            push(@setqueue, $CURRENTSET);
                            $CURRENTSET = \%Ma_2_UTF8;
                            return '';
                            },
           
            $reset          => sub {
                            $CURRENTSET = pop @setqueue;
                            return '';
                            },
           );



my $str     = 'X|a|X';
my $esc_str = qq(A|${switch_to_set_a}A|${reset}A);
my $convstr = fix($esc_str);

if ($str eq $convstr) {
    print "ok\n";
}
else {
    print "didn't work...\n";
}


sub fix {
    my $s = shift;
    
    $s =~ s{(\x1B[abcs])|(.)}       
           {repl($&)}ge;

    return $s;
}

sub repl {
    my $match = shift;
    if (exists $ESC{$match}) {
        return $ESC{$match}();
    }
    elsif (exists $CURRENTSET->{$match}) {
        return $CURRENTSET->{$match};
    }
    else {
        return $match;
    }
}
__END__
[download]

2. My second thought is to use the regexp feature (?{code}) to swap between charsets. But here I am entirely at a loss on how to implement the idea.

Well, Monks, you have helped me at many times before.
Do you once more have any advice to give me?
Should I work along the lines you have seen in this posting. Or can you suggest better methods? It is a big bonus if the resulting code is rather easy to understand and maintain. Like changing the character mappings, remove mappings, add more mappings and so on.

Yours most sincerly,
/L

Comment on Shooting at a Moving Target Select or Download Code

Replies are listed 'Best First'.

Re: Shooting at a Moving Target
by Jenda (Abbot) on Feb 09, 2007 at 16:02 UTC

I think it would be best to split the record on the escape character, convert each part to UTF8 separately and then merge the results back. Something like

my $record = "a" . $record; # prepend the ID of the default charset
my @parts = split /\x1B/, $record;
foreach my $part (@parts) {
 my $charset = substr( $part,0,1,'');
 $charset = $ESC{$charset}; # to get the name from the id
 $part = decode( $charset, $part);
}
$record = join '', @parts;
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re^2: Shooting at a Moving Target

by Anonymous Monk on Feb 09, 2007 at 22:11 UTC

[reply]

Re: Shooting at a Moving Target
by almut (Canon) on Feb 09, 2007 at 16:44 UTC

In case every single part in between the ESC sequences does conform to an encoding that can be handled with Encode::decode, the approach suggested by Jenda seems very reasonable.

Otherwise, if you need more arbitrary mapping facilities, I don't think there's anything fundamentally wrong with the approach you implemented :) - as long as you don't need arbitrary multibyte-to-multibyte mappings (OTOH, you do mention "...the byte sequence \xE0\x65 may represent one character" ...). In the latter case, the '.' in your substitution regex might become a little unwieldy. For singlebyte-to-multibyte mappings though (like "legacy" to UTF-8), it seems fine to me.

For more complex requirements, it might ultimately turn out to be easier to use a proper parser (e.g. Parse::RecDescent), but YMMV.

[reply]
[d/l]
[select]

Re: Shooting at a Moving Target
by varian (Chaplain) on Feb 09, 2007 at 16:36 UTC

For the conversion between character sets you could opt to reuse the existing (core) module Encode and add a new encoding for each new character set that you would like to support.
See Encode::Encoding for instructions.

Using the Encode module is straight forward, some nice examples are given in the Unicode Tutorial on Perlmonks.

[reply]

Re: Shooting at a Moving Target
by talexb (Chancellor) on Feb 09, 2007 at 16:48 UTC

I don't have time to write some code to demonstrate the idea, but I wonder if you could use a HoH (hash of hashes) in order to deconstruct the escape sequences, and from there perform your search and replace operations.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

[reply]

Re: Shooting at a Moving Target
by Moron (Curate) on Feb 12, 2007 at 15:22 UTC

Encode::Unicode

-M

Free your mind

[reply]

Re: Shooting at a Moving Target
by belg4mit (Prior) on Feb 14, 2007 at 02:33 UTC

$TABLE='a'; #or whatever the default set is
s/(?:\x1B([abcs])(?{$TABLE=$^N}))?(.)/${$TABLE}{$2}/e
[download]

||$2

Mmmm symbolic references.

UPDATE:

$TABLES{$TABLE}->{$2}

-- In Bob We Trust, All Others Bring Data.

[reply]
[d/l]
[select]

Re^2: Shooting at a Moving Target

by Aristotle (Chancellor) on Feb 14, 2007 at 02:55 UTC

my $char_for = $codepoint_table{ 'a' };

s{
    \x1B ([abcs]) | (.)
}{
    $1 ? do { $char_for = $codepoint_table{ $1 }; '' } : $char_for->{ 
+$2 }
}ge;
[download]

Makeshifts last the longest.

[reply]
[d/l]

Re^3: Shooting at a Moving Target

by belg4mit (Prior) on Feb 14, 2007 at 03:04 UTC

(?{

-- In Bob We Trust, All Others Bring Data.

[reply]
[d/l]