Re: RTF'ing unicode

I think the old node I linked to above won't really help that much. I tried pasting your sample data into file on my mac and opened it with TextEdit, which showed me some strings of (I'm guessing) Chinese and Japanese, an accented vowel and a smiley-face character. Using TextEdit to export this as html and then transliterating to hex code points (cf. tlu -- TransLiterate Unicode), I got:

\x{5e83}\x{544a}\x{63b2}\x{8f09}
(hyphen)
\x{30d3}\x{30b8}\x{30cd}\x{30b9}
(space)
\x{30bd}\x{30ea}\x{30e5}\x{30fc}\x{30b7}\x{30e7}\x{30f3} 
(space)
S\x{00fc}mfin special \x{263a} !
[download]

Those come, respectively from the original strings:

\'8d\'4c\'8d\'90\'8c\'66\'8d\'da
(hyphen)
\'83\'72\'83\'57\'83\'6c\'83\'58
(space)
\'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93
(space)
S\'fcmfin special \uc0\u9786  !
[download]

Only the last string there makes any sense to me ("\'fc" equates to the u-umlaut, and 9786 (decimal) = \x{263a}. As for the other three lines, it seems like there's some sort of consistent arithmetic going on, which must be cleverly triggered by those ~~"\ulc2"~~ "\f0" flags, but the nature of the mapping is far from self-evident. (That is, assuming my experiment with TextEdit actually reproduced the originally intended characters.)

~~In fact, I suppose it would be foolish of me to take this any further until there's some confirmation (or correction) of the characters I got.~~ Actually, it seems pretty clear what's going on...

UPDATE: It turns out the key is in the "\fonttbl" stuff: "\f0" refers to a font called "HiraMinProN-W3", which is based on one of the following encodings (not sure which): MacJapanese or cp932 (the two are probably "equivalent").

You can get the mapping of byte pairs like "\'83\'5c" into unicode by turning them into 16-bit words (0x835c), and then using Encode -- e.g.:

use Encode;

$in = "\\'8d\\'4c\\'8d\\'90\\'8c\\'66\\'8d\\'da";
my $string = join( "", ( $in =~ /\\'([0-9a-f]{2})/g )); 
my $out = decode("cp932", pack( "H*", $string));

print "$in --> $string --> $out\n"; # assumes STDOUT is set for utf8 o
+utput
[download]

(Another update: alas, I'm not sure how to work this sort of thing in with the other stuff you need to do in order to extract all the RTF content coherently...)

Comment on Re: RTF'ing unicode Select or Download Code