Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

do you remember how the old printers were dealing with character sets?
They used ESCape sequences to switch between tables, and by doing so they were able to access more characters than would have been possible else.

Now I am working with records that are displaying the same technique.
E.g. In the default character set the byte sequence \xE0\x65 may represent one character.
But after hitting upon the ESC sequence \x1B\x49 we are supposed to look in another character set. And from here on \xE0\x65 may denote a different character.

I want to be able to convert these records to UTF8.

I have been thinking of two ways to deal with the changing character sets.
1. To use $line =~ s/$regexp|$ESC/$CURRENTCHARSET->{$&} or change_charset/ge; Where I try to catch any ESC sequences and if one is encountered I simply substitutes the ESC seq itself for an empty string and change the CURRENT CHARSET to the appropriate one.

Here is some code that deals with an utterly simplified case. It seems to work, even if it probably is not very efficient. (But in my case I don't think it is any issue.)
#!/usr/bin/perl -w use strict; use diagnostics; # Mappings between character sets my %M_2_UTF8 = ( 'A' => 'X', ); # default set my %Ma_2_UTF8 = ( 'A' => 'a', ); # set a my $CURRENTSET = \%M_2_UTF8; my @setqueue = (); # a stack where we push and + pop # ESC-sequences my $switch_to_set_a = "\x1Ba"; my $reset = "\x1Bs"; # return to previous charse +t my %ESC = ( $switch_to_set_a => sub { push(@setqueue, $CURRENTSET); $CURRENTSET = \%Ma_2_UTF8; return ''; }, $reset => sub { $CURRENTSET = pop @setqueue; return ''; }, ); my $str = 'X|a|X'; my $esc_str = qq(A|${switch_to_set_a}A|${reset}A); my $convstr = fix($esc_str); if ($str eq $convstr) { print "ok\n"; } else { print "didn't work...\n"; } sub fix { my $s = shift; $s =~ s{(\x1B[abcs])|(.)} {repl($&)}ge; return $s; } sub repl { my $match = shift; if (exists $ESC{$match}) { return $ESC{$match}(); } elsif (exists $CURRENTSET->{$match}) { return $CURRENTSET->{$match}; } else { return $match; } } __END__
2. My second thought is to use the regexp feature (?{code}) to swap between charsets. But here I am entirely at a loss on how to implement the idea.

Well, Monks, you have helped me at many times before.
Do you once more have any advice to give me?
Should I work along the lines you have seen in this posting. Or can you suggest better methods? It is a big bonus if the resulting code is rather easy to understand and maintain. Like changing the character mappings, remove mappings, add more mappings and so on.

Yours most sincerly,
/L

Replies are listed 'Best First'.
Re: Shooting at a Moving Target
by Jenda (Abbot) on Feb 09, 2007 at 16:02 UTC

    I think it would be best to split the record on the escape character, convert each part to UTF8 separately and then merge the results back. Something like

    my $record = "a" . $record; # prepend the ID of the default charset my @parts = split /\x1B/, $record; foreach my $part (@parts) { my $charset = substr( $part,0,1,''); $charset = $ESC{$charset}; # to get the name from the id $part = decode( $charset, $part); } $record = join '', @parts;

      This is a very clear approach.
      I like the idea.
      I will have to look further into this line of thought.

      Thank you for your suggestion Jenda,

      /L
Re: Shooting at a Moving Target
by almut (Canon) on Feb 09, 2007 at 16:44 UTC

    In case every single part in between the ESC sequences does conform to an encoding that can be handled with Encode::decode, the approach suggested by Jenda seems very reasonable.

    Otherwise, if you need more arbitrary mapping facilities, I don't think there's anything fundamentally wrong with the approach you implemented :) - as long as you don't need arbitrary multibyte-to-multibyte mappings (OTOH, you do mention "...the byte sequence \xE0\x65 may represent one character" ...). In the latter case, the '.' in your substitution regex might become a little unwieldy. For singlebyte-to-multibyte mappings though (like "legacy" to UTF-8), it seems fine to me.

    For more complex requirements, it might ultimately turn out to be easier to use a proper parser (e.g. Parse::RecDescent), but YMMV.

Re: Shooting at a Moving Target
by varian (Chaplain) on Feb 09, 2007 at 16:36 UTC
    Using a regexp to split the datastream between character sets makes perfect sense.

    For the conversion between character sets you could opt to reuse the existing (core) module Encode and add a new encoding for each new character set that you would like to support.
    See Encode::Encoding for instructions.

    Using the Encode module is straight forward, some nice examples are given in the Unicode Tutorial on Perlmonks.

Re: Shooting at a Moving Target
by talexb (Chancellor) on Feb 09, 2007 at 16:48 UTC

    I don't have time to write some code to demonstrate the idea, but I wonder if you could use a HoH (hash of hashes) in order to deconstruct the escape sequences, and from there perform your search and replace operations.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Shooting at a Moving Target
by Moron (Curate) on Feb 12, 2007 at 15:22 UTC
    The documentation available for Encode::Unicode looks relevant to your problem, irrespective of whether you end up actually using that module.

    -M

    Free your mind

Re: Shooting at a Moving Target
by belg4mit (Prior) on Feb 14, 2007 at 02:33 UTC
    (Pseudo-code) evilness using experimental regexp features might look like:
    $TABLE='a'; #or whatever the default set is s/(?:\x1B([abcs])(?{$TABLE=$^N}))?(.)/${$TABLE}{$2}/e
    To simplify your table you could expand the RHS to include an ||$2.

    Mmmm symbolic references.

    UPDATE: You could of course "legitimize" that and use $TABLES{$TABLE}->{$2}

    --
    In Bob We Trust, All Others Bring Data.

      my $char_for = $codepoint_table{ 'a' }; s{ \x1B ([abcs]) | (.) }{ $1 ? do { $char_for = $codepoint_table{ $1 }; '' } : $char_for->{ +$2 } }ge;

      Makeshifts last the longest.

        But where's the evil?! (You also didn't use the "requested" (?{). TIMTOWTDI

        --
        In Bob We Trust, All Others Bring Data.