Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I've been digging around through the various perl modules related to parsing RTF files and am finding that so far they all fail to correctly parse unicode text from the RTF file. I've also tried catdoc, antiword, RTF::Tokenizer, and rtf2text (basically a wrapper around the RTF::Parser stuff).
Have any monks out there already tread this lonely and desolate path? Are there any Perl modules or linux CLI tools out there that can do this?
An example rtf file would be:
Where \'83\'72\'83\'57\'83\'6c\'83\'58 and \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93 are the escaped unicode strings that nothing I've tried seems to handle.{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf250 {\fonttbl\f0\fnil\fcharset128 HiraMinProN-W3;\f1\froman\fcharset0 Time +s-Roman;\f2\fswiss\fcharset0 Helvetica; } {\colortbl;\red255\green255\blue255;\red0\green22\blue231;} \margl1440\margr1440\vieww9000\viewh8400\viewkind0 \deftab720 \pard\pardeftab720\ql\qnatural {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl/ja/ads/"}}{ +\fldrslt \f0\fs20 \cf2 \ul \ulc2 \'8d\'4c\'8d\'90\'8c\'66\'8d\'da}} \f1\fs20 - {\field{\*\fldinst{HYPERLINK "http://www.google.co.jp/intl +/ja/services/"}}{\fldrslt \f0 \cf2 \ul \ulc2 \'83\'72\'83\'57\'83\'6c +\'83\'58 \f1 \f0 \'83\'5c\'83\'8a\'83\'85\'81\'5b\'83\'56\'83\'87\'83\'93}} \f0 \cf2 \ul \ulc2 \f2\fs24 \cf0 \ulnone S\'fcmfin special \uc0\u9786 !}
Can anyone help this poor mendicant, reduced to tears by the code comments within RTF::Parser?
Edit by GrandFather - replaced pre tags with code + tags, to prevent distortion of site layout and allow code extraction.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RTF'ing unicode
by graff (Chancellor) on May 05, 2010 at 02:57 UTC | |
|
Re: RTF'ing unicode
by choroba (Cardinal) on May 04, 2010 at 23:23 UTC | |
by Anonymous Monk on May 04, 2010 at 23:37 UTC | |
by choroba (Cardinal) on May 04, 2010 at 23:45 UTC | |
by Anonymous Monk on May 05, 2010 at 00:01 UTC | |
|
Re: RTF'ing unicode
by graff (Chancellor) on May 05, 2010 at 02:15 UTC |