I've been digging around through the various perl modules related to parsing RTF files and am finding that so far they all fail to correctly parse unicode text from the RTF file. I've also tried catdoc, antiword, RTF::Tokenizer, and rtf2text (basically a wrapper around the RTF::Parser stuff).
Have any monks out there already tread this lonely and desolate path? Are there any Perl modules or linux CLI tools out there that can do this?
An example rtf file would be:
Where \'83\'72\'83\'57\'83\'6c\'83\'58 and \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 are the escaped unicode strings that nothing I've tried seems to handle.{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf250 {\fonttbl\f0\fnil\fcharset128 HiraMinProN-W3;\f1\froman\fcharset0 Time +s-Roman;\f2\fswiss\fcharset0 Helvetica; } {\colortbl;\red255\green255\blue255;\red0\green22\blue231;} \margl1440\margr1440\vieww9000\viewh8400\viewkind0 \deftab720 \pard\pardeftab720\ql\qnatural {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl/ja/ads/"}}{ +\fldrslt \f0\fs20 \cf2 \ul \ulc2 \'8d\'4c\'8d\'90\'8c\'66\'8d\'da}} \f1\fs20 - {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl +/ja/services/"}}{\fldrslt \f0 \cf2 \ul \ulc2 \'83\'72\'83\'57\'83\'6c +\'83\'58 \f1 \f0 \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93}} \f0 \cf2 \ul \ulc2 \f2\fs24 \cf0 \ulnone S\'fcmfin special \uc0\u9786 !}
Can anyone help this poor mendicant, reduced to tears by the code comments within RTF::Parser?
Edit by GrandFather - replaced pre tags with code + tags, to prevent distortion of site layout and allow code extraction.
In reply to RTF'ing unicode by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |