Re: Parse::RecDescent grammar for RTF
by Willard B. Trophy (Hermit) on Nov 02, 2002 at 13:50 UTC
|
I don't know of an RTF grammar for Parse::RecDescent. RTF is a bit of a mess, structurally, so parsing isn't trivial. Even major applications write RTF that isn't quite standard.
I, too, have been burnt by the RTF parser on CPAN (RTF::Parser); it nearly does what I wanted, but is exceedingly hard to customize. Basically, if RTF::Parser doesn't do exactly what you want out of the box (and it can; its HTML output is pretty cool), look elsewhere. Ths speaks the voice of bitter experience.
A low-level solution which works for me is RTF::Tokenizer, on which I've based a production system for converting RTF dictionary data to Quark XPress tags. RTF::Tokenizer has its quirks; give me a yell if you need help.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n" | [reply] |
Re: Parse::RecDescent grammar for RTF
by PodMaster (Abbot) on Nov 02, 2002 at 15:15 UTC
|
How's your C knowledge?
You can find the RTF specification at http://www.wotsit.org/ (like you can most any file format a programmer might need help with), and it includes a sample c reader (Appendix A).
It doesn't look like it'd be too hard to develop a grammar, although it looks like it'd be lots easier to just develop a parser ;)
Anybody looking for a project this smells like a good one.
update:
If you're looking for strategy, try looking at a Latex parser, cause LateX and RTF look very similar if you ask me.
I'm suprised there isn't a opensource library already out there to do this (i know there is a non-free one that looks like it'd be useful).
____________________________________________________ ** The Third rule of perl club is a statement of fact: pod is sexy. | [reply] |
|
|
I'd revise your statement:
> Anybody looking for a project this smells like a good one
to:
Anybody looking for a project, this smells.
RTF is an unpleasant format. The basic spec might be okay for creating a writer, but creating a reader that will handle arbitrary RTF is quite hard.
Thus speaks someone who has just spent the last five months dealing with RTF parsing.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"
| [reply] |
|
|
Thank you for a very interesting link, I'll save it for a future.
As it looks like ready-to-use grammar currently does not exists, I'll try writing one by myself and will show it on this site.
However, I expect it to be extremely slow on parsing.
I'll let you know about my further results.
Courage, the Cowardly Dog
| [reply] |
Re: Parse::RecDescent grammar for RTF
by graff (Chancellor) on Nov 02, 2002 at 13:48 UTC
|
Having installed RTF::Parser on my linux laptop just now,
and seeing the README page
for that module, it doesn't look like it's in a usable state
-- e.g. there doesn't appear to be any documentation for how
to use it. (And there wasn't any explicit mention of how
old it is -- but the downloaded files under .cpan/build
date from July 1999.) This looks like a dead-end, orphaned
module. (Should CPAN have something like a garbage-collection
process to clear away stuff like this?)
On the other hand, RTF::Tokenizer seems to be quite current,
and is documented. You didn't mention what you need to do,
but maybe this module will be able to help you out. | [reply] |
|
|
I've chatted with Pete Sergeant; I don't think that RTF::Tokenizer will be developed further. It is useful, though, if you are careful to preprocess special characters before running the tokenizer. It really, really doesn't like \~ codes for non-break spaces.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"
| [reply] |