comment on

I think the old node I linked to above won't really help that much. I tried pasting your sample data into file on my mac and opened it with TextEdit, which showed me some strings of (I'm guessing) Chinese and Japanese, an accented vowel and a smiley-face character. Using TextEdit to export this as html and then transliterating to hex code points (cf. tlu -- TransLiterate Unicode), I got:

\x{5e83}\x{544a}\x{63b2}\x{8f09}
(hyphen)
\x{30d3}\x{30b8}\x{30cd}\x{30b9}
(space)
\x{30bd}\x{30ea}\x{30e5}\x{30fc}\x{30b7}\x{30e7}\x{30f3} 
(space)
S\x{00fc}mfin special \x{263a} !
[download]

Those come, respectively from the original strings:

\'8d\'4c\'8d\'90\'8c\'66\'8d\'da
(hyphen)
\'83\'72\'83\'57\'83\'6c\'83\'58
(space)
\'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93
(space)
S\'fcmfin special \uc0\u9786  !
[download]

Only the last string there makes any sense to me ("\'fc" equates to the u-umlaut, and 9786 (decimal) = \x{263a}. As for the other three lines, it seems like there's some sort of consistent arithmetic going on, which must be cleverly triggered by those ~~"\ulc2"~~ "\f0" flags, but the nature of the mapping is far from self-evident. (That is, assuming my experiment with TextEdit actually reproduced the originally intended characters.)

~~In fact, I suppose it would be foolish of me to take this any further until there's some confirmation (or correction) of the characters I got.~~ Actually, it seems pretty clear what's going on...

UPDATE: It turns out the key is in the "\fonttbl" stuff: "\f0" refers to a font called "HiraMinProN-W3", which is based on one of the following encodings (not sure which): MacJapanese or cp932 (the two are probably "equivalent").

You can get the mapping of byte pairs like "\'83\'5c" into unicode by turning them into 16-bit words (0x835c), and then using Encode -- e.g.:

use Encode;

$in = "\\'8d\\'4c\\'8d\\'90\\'8c\\'66\\'8d\\'da";
my $string = join( "", ( $in =~ /\\'([0-9a-f]{2})/g )); 
my $out = decode("cp932", pack( "H*", $string));

print "$in --> $string --> $out\n"; # assumes STDOUT is set for utf8 o
+utput
[download]

(Another update: alas, I'm not sure how to work this sort of thing in with the other stuff you need to do in order to extract all the RTF content coherently...)

In reply to Re: RTF'ing unicode by graff
in thread RTF'ing unicode by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.