The basic strategy I arrived upon was to use the character maps provided at
unicode.org, in particular the Mac mapping and the Windows code page. These are well
formatted and can be "linked" to one another through a
text field that has the Unicode standard name for each character, and the hex code for each character under each mapping. All good so far.
Here's where I fell in a hole.
I was thinking in terms of characters, so I was setting up my replacements like this:
The keys to each hash are the Unicode standard text descriptions for the characters, and the values are the
hex codes.
So I go through, find the hex codes that differ, and
pack the actual character into
a string for later replacement.
Where's the problem? Well, as long as I print them out singly, to check that the replacements are what I expect, or use s///g, (throwing the replacements into a hash as in the thread referenced earlier) there isn't a problem. However, using eval("tr/$wrong/$right/") things go haywire, and some of the transliterations are just plain wrong -- and I can't figure ot why. This pains me because this is clearly a transliteration job (much, much faster) rather than a search/replace job.
So I ponder. I mean, TMTOWTDI, right?
I also peruse the docs. perlman:perlop proves to be very enlightening, in fact. In the worthy tome it says
(at the very end). I see this and know that this is the solution -- somehow using the actual 8 bit characters is making the tr/// operator unhappy, and this is the way around it. I have at my disposal hex codes, which I simply need to switch to octal. So instead of the method above where I pack, I use:tr [\200-\377] [\000-\177]; # delete 8th bit
See sprintf and hex for more on these features -- or turn on your function nodelet!foreach $unicode (keys %from) { if (defined($to{$unicode})) { if ($from{$unicode} ne $to{$unicode}) { my $fromhex = $from{$unicode}; my $tohex = $to{$unicode}; $wrong = $wrong . '\\' . sprintf('%o', hex $fromhex); $right = $right . '\\' . sprintf('%o', hex $tohex); } } }
Now everything is happy and I can continue on my merry way.
Anyone know why tr would be unhappy with explicit 8 bit chars? I'm using ActivePerl on Win2k.
In reply to Eight bit character (non-ASCII) conversion by snax
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |