Well, thanks to the wisdom imparted by the Monks here in my first thread, I thought I had licked the problem of quick conversion from Mac 8-bit european characters to Windows code page, by employing tr/// rather than s///g. This is correct (i.e. much faster) but pitfalls remain. Maybe my experience will help someone else, so here goes.

The basic strategy I arrived upon was to use the character maps provided at unicode.org, in particular the Mac mapping and the Windows code page. These are well formatted and can be "linked" to one another through a text field that has the Unicode standard name for each character, and the hex code for each character under each mapping. All good so far.

Here's where I fell in a hole.

I was thinking in terms of characters, so I was setting up my replacements like this:

The keys to each hash are the Unicode standard text descriptions for the characters, and the values are the hex codes. So I go through, find the hex codes that differ, and pack the actual character into a string for later replacement.

Where's the problem? Well, as long as I print them out singly, to check that the replacements are what I expect, or use s///g, (throwing the replacements into a hash as in the thread referenced earlier) there isn't a problem. However, using eval("tr/$wrong/$right/") things go haywire, and some of the transliterations are just plain wrong -- and I can't figure ot why. This pains me because this is clearly a transliteration job (much, much faster) rather than a search/replace job.

So I ponder. I mean, TMTOWTDI, right?

I also peruse the docs. perlman:perlop proves to be very enlightening, in fact. In the worthy tome it says

tr [\200-\377] [\000-\177]; # delete 8th bit
(at the very end). I see this and know that this is the solution -- somehow using the actual 8 bit characters is making the tr/// operator unhappy, and this is the way around it. I have at my disposal hex codes, which I simply need to switch to octal. So instead of the method above where I pack, I use:
foreach $unicode (keys %from) { if (defined($to{$unicode})) { if ($from{$unicode} ne $to{$unicode}) { my $fromhex = $from{$unicode}; my $tohex = $to{$unicode}; $wrong = $wrong . '\\' . sprintf('%o', hex $fromhex); $right = $right . '\\' . sprintf('%o', hex $tohex); } } }
See sprintf and hex for more on these features -- or turn on your function nodelet!

Now everything is happy and I can continue on my merry way.

Anyone know why tr would be unhappy with explicit 8 bit chars? I'm using ActivePerl on Win2k.


In reply to Eight bit character (non-ASCII) conversion by snax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.