in reply to RTF'ing unicode

I think the old node I linked to above won't really help that much. I tried pasting your sample data into file on my mac and opened it with TextEdit, which showed me some strings of (I'm guessing) Chinese and Japanese, an accented vowel and a smiley-face character. Using TextEdit to export this as html and then transliterating to hex code points (cf. tlu -- TransLiterate Unicode), I got:
\x{5e83}\x{544a}\x{63b2}\x{8f09} (hyphen) \x{30d3}\x{30b8}\x{30cd}\x{30b9} (space) \x{30bd}\x{30ea}\x{30e5}\x{30fc}\x{30b7}\x{30e7}\x{30f3} (space) S\x{00fc}mfin special \x{263a} !
Those come, respectively from the original strings:
\'8d\'4c\'8d\'90\'8c\'66\'8d\'da (hyphen) \'83\'72\'83\'57\'83\'6c\'83\'58 (space) \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 (space) S\'fcmfin special \uc0\u9786 !
Only the last string there makes any sense to me ("\'fc" equates to the u-umlaut, and 9786 (decimal) = \x{263a}. As for the other three lines, it seems like there's some sort of consistent arithmetic going on, which must be cleverly triggered by those "\ulc2" "\f0" flags, but the nature of the mapping is far from self-evident. (That is, assuming my experiment with TextEdit actually reproduced the originally intended characters.)

In fact, I suppose it would be foolish of me to take this any further until there's some confirmation (or correction) of the characters I got. Actually, it seems pretty clear what's going on...

UPDATE: It turns out the key is in the "\fonttbl" stuff: "\f0" refers to a font called "HiraMinProN-W3", which is based on one of the following encodings (not sure which): MacJapanese or cp932 (the two are probably "equivalent").

You can get the mapping of byte pairs like "\'83\'5c" into unicode by turning them into 16-bit words (0x835c), and then using Encode -- e.g.:

use Encode; $in = "\\'8d\\'4c\\'8d\\'90\\'8c\\'66\\'8d\\'da"; my $string = join( "", ( $in =~ /\\'([0-9a-f]{2})/g )); my $out = decode("cp932", pack( "H*", $string)); print "$in --> $string --> $out\n"; # assumes STDOUT is set for utf8 o +utput
(Another update: alas, I'm not sure how to work this sort of thing in with the other stuff you need to do in order to extract all the RTF content coherently...)