Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been digging around through the various perl modules related to parsing RTF files and am finding that so far they all fail to correctly parse unicode text from the RTF file. I've also tried catdoc, antiword, RTF::Tokenizer, and rtf2text (basically a wrapper around the RTF::Parser stuff).

Have any monks out there already tread this lonely and desolate path? Are there any Perl modules or linux CLI tools out there that can do this?

An example rtf file would be:

{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf250 {\fonttbl\f0\fnil\fcharset128 HiraMinProN-W3;\f1\froman\fcharset0 Time +s-Roman;\f2\fswiss\fcharset0 Helvetica; } {\colortbl;\red255\green255\blue255;\red0\green22\blue231;} \margl1440\margr1440\vieww9000\viewh8400\viewkind0 \deftab720 \pard\pardeftab720\ql\qnatural {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl/ja/ads/"}}{ +\fldrslt \f0\fs20 \cf2 \ul \ulc2 \'8d\'4c\'8d\'90\'8c\'66\'8d\'da}} \f1\fs20 - {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl +/ja/services/"}}{\fldrslt \f0 \cf2 \ul \ulc2 \'83\'72\'83\'57\'83\'6c +\'83\'58 \f1 \f0 \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93}} \f0 \cf2 \ul \ulc2 \f2\fs24 \cf0 \ulnone S\'fcmfin special \uc0\u9786 !}
Where \'83\'72\'83\'57\'83\'6c\'83\'58 and \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 are the escaped unicode strings that nothing I've tried seems to handle.

Can anyone help this poor mendicant, reduced to tears by the code comments within RTF::Parser?

Edit by GrandFather - replaced pre tags with code + tags, to prevent distortion of site layout and allow code extraction.

Replies are listed 'Best First'.
Re: RTF'ing unicode
by graff (Chancellor) on May 05, 2010 at 02:57 UTC
    I think the old node I linked to above won't really help that much. I tried pasting your sample data into file on my mac and opened it with TextEdit, which showed me some strings of (I'm guessing) Chinese and Japanese, an accented vowel and a smiley-face character. Using TextEdit to export this as html and then transliterating to hex code points (cf. tlu -- TransLiterate Unicode), I got:
    \x{5e83}\x{544a}\x{63b2}\x{8f09} (hyphen) \x{30d3}\x{30b8}\x{30cd}\x{30b9} (space) \x{30bd}\x{30ea}\x{30e5}\x{30fc}\x{30b7}\x{30e7}\x{30f3} (space) S\x{00fc}mfin special \x{263a} !
    Those come, respectively from the original strings:
    \'8d\'4c\'8d\'90\'8c\'66\'8d\'da (hyphen) \'83\'72\'83\'57\'83\'6c\'83\'58 (space) \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 (space) S\'fcmfin special \uc0\u9786 !
    Only the last string there makes any sense to me ("\'fc" equates to the u-umlaut, and 9786 (decimal) = \x{263a}. As for the other three lines, it seems like there's some sort of consistent arithmetic going on, which must be cleverly triggered by those "\ulc2" "\f0" flags, but the nature of the mapping is far from self-evident. (That is, assuming my experiment with TextEdit actually reproduced the originally intended characters.)

    In fact, I suppose it would be foolish of me to take this any further until there's some confirmation (or correction) of the characters I got. Actually, it seems pretty clear what's going on...

    UPDATE: It turns out the key is in the "\fonttbl" stuff: "\f0" refers to a font called "HiraMinProN-W3", which is based on one of the following encodings (not sure which): MacJapanese or cp932 (the two are probably "equivalent").

    You can get the mapping of byte pairs like "\'83\'5c" into unicode by turning them into 16-bit words (0x835c), and then using Encode -- e.g.:

    use Encode; $in = "\\'8d\\'4c\\'8d\\'90\\'8c\\'66\\'8d\\'da"; my $string = join( "", ( $in =~ /\\'([0-9a-f]{2})/g )); my $out = decode("cp932", pack( "H*", $string)); print "$in --> $string --> $out\n"; # assumes STDOUT is set for utf8 o +utput
    (Another update: alas, I'm not sure how to work this sort of thing in with the other stuff you need to do in order to extract all the RTF content coherently...)
Re: RTF'ing unicode
by choroba (Cardinal) on May 04, 2010 at 23:23 UTC
    rtf2html can convert those to html entities. But I do not know whether anything can make another step and convert html to something else :-)

      It doesn't seem to for me. I get

      <u>&#141;&#76;&#141;&#144;OE&#102;&#141;&#218;</u> - <u>f&#114;f&#87;f +&#108;f&#88; f&#92;fSf...&#129;&#91;f&#86;f++f``</u><u> </u>S&#252;mf +in special !</body>

      which is the html entity version of what rtf2text spits out:
      LOEfÚ - frfWflfX f\fSf...[fVf++f`` Sümfin special !
        Oh, you are right. I see a slightly different output:
        <u>&#141;&#76;&#141;&#144;&#140;&#102;&#141;&#218;</u> - <u>&#131;&#11 +4;&#131;&#87;&#131;&#108;&#131;&#88; &#131;&#92;&#131;&#138;&#131;&#1 +33;&#129;&#91;&#131;&#86;&#131;&#135;&#131;&#147;</u><u> </u>S&#252;m +fin special !
        but it is probably not utf8 either. Why does the RTF header state it is cp 1252?
Re: RTF'ing unicode
by graff (Chancellor) on May 05, 2010 at 02:15 UTC