Eight bit character (non-ASCII) conversion

snax has asked for the wisdom of the Perl Monks concerning the following question:

Well, thanks to the wisdom imparted by the Monks here in my first thread, I thought I had licked the problem of quick conversion from Mac 8-bit european characters to Windows code page, by employing tr/// rather than s///g. This is correct (i.e. much faster) but pitfalls remain. Maybe my experience will help someone else, so here goes.

The basic strategy I arrived upon was to use the character maps provided at unicode.org, in particular the Mac mapping and the Windows code page . These are well formatted and can be "linked" to one another through a text field that has the Unicode standard name for each character, and the hex code for each character under each mapping. All good so far.

Here's where I fell in a hole.

I was thinking in terms of characters, so I was setting up my replacements like this:

foreach $unicode (keys %from) {
    if (defined($to{$unicode})) {
        if ($from{$unicode} ne $to{$unicode}) {

            my $fromchar = $from{$unicode};
            my $fromchar =~ s/0x(..)/pack('c',hex($1))/ge;

            $tochar = $to{$unicode};
            $tochar =~ s/0x(..)/pack('c',hex($1))/ge;

            $wrong = $wrong . $fromchar;
            $right = $right . $tochar;
        }
    }
}
[download]

The keys to each hash are the Unicode standard text descriptions for the characters, and the values are the hex codes. So I go through, find the hex codes that differ, and pack the actual character into a string for later replacement.

Where's the problem? Well, as long as I print them out singly, to check that the replacements are what I expect, or use s///g, (throwing the replacements into a hash as in the thread referenced earlier) there isn't a problem. However, using eval("tr/$wrong/$right/") things go haywire, and some of the transliterations are just plain wrong -- and I can't figure ot why. This pains me because this is clearly a transliteration job (much, much faster) rather than a search/replace job.

So I ponder. I mean, TMTOWTDI, right?

I also peruse the docs. perlman:perlop proves to be very enlightening, in fact. In the worthy tome it says

    tr [\200-\377]
       [\000-\177];             # delete 8th bit
[download]

(at the very end). I see this and know that this is the solution -- somehow using the actual 8 bit characters is making the tr/// operator unhappy, and this is the way around it. I have at my disposal hex codes, which I simply need to switch to octal. So instead of the method above where I pack, I use:

foreach $unicode (keys %from) {
    if (defined($to{$unicode})) {
        if ($from{$unicode} ne $to{$unicode}) {

            my $fromhex = $from{$unicode};
            my $tohex = $to{$unicode};

            $wrong = $wrong . '\\' . 
                        sprintf('%o', hex $fromhex);
            $right = $right . '\\' . 
                        sprintf('%o', hex $tohex);

        }

    }
}
[download]

See sprintf and hex for more on these features -- or turn on your function nodelet!

Now everything is happy and I can continue on my merry way.

Anyone know why tr would be unhappy with explicit 8 bit chars? I'm using ActivePerl on Win2k.

Comment on Eight bit character (non-ASCII) conversion Select or Download Code

Replies are listed 'Best First'.
(tye)RE2: Eight bit character (non-ASCII) conversion by tye (Sage) on Nov 15, 2000 at 22:33 UTC
Of course, if either $wrong or $right contains "/", then you will run into problems because you'll end up with `eval "tr/$wrong1/$wrong2/$right1/$right2/"` [download] where $wrong1 is the part of $wrong before the "/" and $wrong2 is the part of $wrong after the "/", etc. You'd probably get a syntax error, which is why you should always check for failure of eval and report $@. Just to verify that this is the only problem, I wrote this: #!/usr/bin/perl -w use strict; my $all= pack "C", 0..255; print '$all has ', length($all), " bytes.\n"; my $in= $all; $in =~ tr-/--d; print '$in has ', length($in), " bytes.\n"; my $out= $in; $out= chop($out) . $out; print '$out has ', length($out), " bytes.\n"; my $count= eval "\$all =~ tr/$in/$out/"; warn "$@" if $@; print "Translated $count bytes.\n"; print join( ",", unpack "C", $all ), "\n"; __END__ $all has 256 bytes. $in has 255 bytes. $out has 255 bytes. Translated 254 bytes. 255,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45, 47,46,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67, 68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89, 90,92,91,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108, 109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124, 125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140, 141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156, 157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172, 173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188, 189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204, 205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220, 221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236, 237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252, 253,254 [download] Note that this works just fine on every non-delimiter charater, including nul ("\0") and including 8-bit characters. Notice the output sequence includes "44,45,47,46,48" where 47 (ASCII for "/") was not changed. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
RE: (tye)RE2: Eight bit character (non-ASCII) conversion by snax (Hermit) on Nov 15, 2000 at 23:27 UTC
I am properly chastened -- I do have a tendency to forget checking for failures, which is really very bad. This is a slick way to check problems with `tr`, and still leaves me scratching my head: I checked my `$wrong` and `$right` strings and they contained no `/` (surprise) but using the `pack`ed characters produces a different result than using the octal "names", and only when using `tr///` -- not when printing them out to check or using them in a `s///g` construction. I must do further investigation.	[reply]
RE: Eight bit character (non-ASCII) conversion by lhoward (Vicar) on Nov 15, 2000 at 17:27 UTC
If I were you I'd just use one of the character set conversion modules that already exists for perl, instead of reinventing the wheel w/ a tr statment: Locale::Iconv Text::Iconv I have used these modules before with great success.	[reply]
RE: RE: Eight bit character (non-ASCII) conversion by snax (Hermit) on Nov 15, 2000 at 17:47 UTC
I think I have a slightly different problem than what these modules address -- moreover, these modules require a local `iconv` implementation, which Win2k does not (to my knowledge). Even if I have an implementation, I need to have the conversion table installed, which is what I was grabbing and parsing from unicode.org -- given that I had already parsed the file I might just as well build the `tr///` strings at the same time. Finally, I couldn't find any documentation indicating which conversion tables (on a Solaris installation) corresponded to what -- nothing that looked like a Mac table or a Windows code page 1252 table, which I think is standard 8859-1 but I'm not positive. The conversion tables are in some binary format, too, so I can't just inspect them. Do you have more info you could share?	[reply]
RE: RE: RE: Eight bit character (non-ASCII) conversion by lhoward (Vicar) on Nov 15, 2000 at 17:54 UTC
I know that you can download a free version of libiconv from libiconv. I don't know if it will build on Windows or not, but it does support a couple of character sets that sound like what you're looking for. With most versions of the iconv library there is an iconv command line program that will do conversions. On mine "iconv --list" will give me a list of all the character sets it supports. You may also want to look at check out some of the unicode conversion modules: Unicode::Map Unicode::Map8 Unicode::MapUTF8 I believe that all of them run without needing any external (non-CPAN supplied) libraries. Since those modules are designed to do map to/from unicode you could solve your problem by doing a 2-step conversion: MAC->UNICODE->WINDOWS1252	[reply]
RE: RE: RE: RE: Eight bit character (non-ASCII) conversion by snax (Hermit) on Nov 15, 2000 at 20:34 UTC
RE: RE: Eight bit character (non-ASCII) conversion by snax (Hermit) on Nov 15, 2000 at 17:46 UTC
I think I have a slightly different problem than what these modules address -- moreover, these modules require a local `iconv` implementation, which Win2k does not (to my knowledge). Even if I have an implementation, I need to have the conversion table installed, which is what I was grabbing and parsing from <a href="http://www.unicode.org>unicode.org -- given that I had already parsed the file I might just as well build the `tr///` strings at the same time. Finally, I couldn't find any documentation indicating which conversion tables (on a Solaris installation) corresponded to what -- nothing that looked like a Mac table or a Windows code page 1252 table, which I think is standard 8859-1 but I'm not positive. The conversion tables are in some binary format, too, so I can't just inspect them. Do you have more info you could share?	[reply]