comment on

Well, thanks to the wisdom imparted by the Monks here in my first thread, I thought I had licked the problem of quick conversion from Mac 8-bit european characters to Windows code page, by employing tr/// rather than s///g. This is correct (i.e. much faster) but pitfalls remain. Maybe my experience will help someone else, so here goes.

The basic strategy I arrived upon was to use the character maps provided at unicode.org, in particular the Mac mapping and the Windows code page . These are well formatted and can be "linked" to one another through a text field that has the Unicode standard name for each character, and the hex code for each character under each mapping. All good so far.

Here's where I fell in a hole.

I was thinking in terms of characters, so I was setting up my replacements like this:

foreach $unicode (keys %from) {
    if (defined($to{$unicode})) {
        if ($from{$unicode} ne $to{$unicode}) {

            my $fromchar = $from{$unicode};
            my $fromchar =~ s/0x(..)/pack('c',hex($1))/ge;

            $tochar = $to{$unicode};
            $tochar =~ s/0x(..)/pack('c',hex($1))/ge;

            $wrong = $wrong . $fromchar;
            $right = $right . $tochar;
        }
    }
}
[download]

The keys to each hash are the Unicode standard text descriptions for the characters, and the values are the hex codes. So I go through, find the hex codes that differ, and pack the actual character into a string for later replacement.

Where's the problem? Well, as long as I print them out singly, to check that the replacements are what I expect, or use s///g, (throwing the replacements into a hash as in the thread referenced earlier) there isn't a problem. However, using eval("tr/$wrong/$right/") things go haywire, and some of the transliterations are just plain wrong -- and I can't figure ot why. This pains me because this is clearly a transliteration job (much, much faster) rather than a search/replace job.

So I ponder. I mean, TMTOWTDI, right?

I also peruse the docs. perlman:perlop proves to be very enlightening, in fact. In the worthy tome it says

    tr [\200-\377]
       [\000-\177];             # delete 8th bit
[download]

(at the very end). I see this and know that this is the solution -- somehow using the actual 8 bit characters is making the tr/// operator unhappy, and this is the way around it. I have at my disposal hex codes, which I simply need to switch to octal. So instead of the method above where I pack, I use:

foreach $unicode (keys %from) {
    if (defined($to{$unicode})) {
        if ($from{$unicode} ne $to{$unicode}) {

            my $fromhex = $from{$unicode};
            my $tohex = $to{$unicode};

            $wrong = $wrong . '\\' . 
                        sprintf('%o', hex $fromhex);
            $right = $right . '\\' . 
                        sprintf('%o', hex $tohex);

        }

    }
}
[download]

See sprintf and hex for more on these features -- or turn on your function nodelet!

Now everything is happy and I can continue on my merry way.

Anyone know why tr would be unhappy with explicit 8 bit chars? I'm using ActivePerl on Win2k.

In reply to Eight bit character (non-ASCII) conversion by snax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.