Those come, respectively from the original strings:\x{5e83}\x{544a}\x{63b2}\x{8f09} (hyphen) \x{30d3}\x{30b8}\x{30cd}\x{30b9} (space) \x{30bd}\x{30ea}\x{30e5}\x{30fc}\x{30b7}\x{30e7}\x{30f3} (space) S\x{00fc}mfin special \x{263a} !
Only the last string there makes any sense to me ("\'fc" equates to the u-umlaut, and 9786 (decimal) = \x{263a}. As for the other three lines, it seems like there's some sort of consistent arithmetic going on, which must be cleverly triggered by those\'8d\'4c\'8d\'90\'8c\'66\'8d\'da (hyphen) \'83\'72\'83\'57\'83\'6c\'83\'58 (space) \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 (space) S\'fcmfin special \uc0\u9786 !
In fact, I suppose it would be foolish of me to take this any further until there's some confirmation (or correction) of the characters I got. Actually, it seems pretty clear what's going on...
UPDATE: It turns out the key is in the "\fonttbl" stuff: "\f0" refers to a font called "HiraMinProN-W3", which is based on one of the following encodings (not sure which): MacJapanese or cp932 (the two are probably "equivalent").
You can get the mapping of byte pairs like "\'83\'5c" into unicode by turning them into 16-bit words (0x835c), and then using Encode -- e.g.:
(Another update: alas, I'm not sure how to work this sort of thing in with the other stuff you need to do in order to extract all the RTF content coherently...)use Encode; $in = "\\'8d\\'4c\\'8d\\'90\\'8c\\'66\\'8d\\'da"; my $string = join( "", ( $in =~ /\\'([0-9a-f]{2})/g )); my $out = decode("cp932", pack( "H*", $string)); print "$in --> $string --> $out\n"; # assumes STDOUT is set for utf8 o +utput
In reply to Re: RTF'ing unicode
by graff
in thread RTF'ing unicode
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |